enable StaticCache for assisted generation #34797

yao-matrix · 2024-11-19T05:10:37Z

@gante , I implemented a version for this issue: #32946. Pls help comment, and I can iterate, thx.

Signed-off-by: Matrix YAO <[email protected]>

Signed-off-by: YAO Matrix <[email protected]>

Signed-off-by: Matrix YAO <[email protected]>

Signed-off-by: YAO Matrix <[email protected]>

Signed-off-by: N <[email protected]>

yao-matrix · 2024-11-21T01:27:34Z

@gante , could you pls take a look？ thx

zucchini-nlp

@yao-matrix hey, gante is currently on a long vacation so I reviewed the PR for him. Thanks for adding support for this, Super cool work!

I left a few comments and also we'll need tests in tests/generation/test_utils.py file. I guess static cache now works with all types of candidate generators right?

zucchini-nlp · 2024-11-19T09:39:48Z

src/transformers/generation/utils.py

+                if assistant_model is not None:
+                    assistant_model._get_cache(
+                        cache_implementation=generation_config.cache_implementation,
+                        batch_size=max(generation_config.num_beams, generation_config.num_return_sequences) * batch_size,
+                        max_cache_len=max_cache_length,
+                        device=device,
+                        model_kwargs=model_kwargs,
+                    )


hmm, I think it will be called on assistant model when we call assistant.generate() so there is no need. We can only remove self.generation_config.cache_implementation = None in candidate generator

the thing is: when we leave to let assistant_model.generate which is in get_candiates to call this. since the max_new _tokens will be set to max_new_tokens = min(int(self.num_assistant_tokens), self.generation_config.max_length - new_cur_len - 1) when it's first-time called, so the cache_length will be set to int(self.num_assistant_tokens) + prompt_len, less than the actual needed cache_length max_token_length + prompt_length, and lead to assert out while generation. So, the key here is assistant model's cache length should be same as main model here. And then I can see this function has assistant_model as an argument but not used it, I think it may be here for the cases like this. That's the rational behind.

oh, i see, that makes sense. Then we can leave cache init here

src/transformers/generation/candidate_generator.py

src/transformers/cache_utils.py

zucchini-nlp

LGTM! We need some tests and then I am requesting review from the core maintainer, after that we can merge

yao-matrix · 2024-11-22T07:17:31Z

@yao-matrix hey, gante is currently on a long vacation so I reviewed the PR for him. Thanks for adding support for this, Super cool work!

I left a few comments and also we'll need tests in tests/generation/test_utils.py file. I guess static cache now works with all types of candidate generators right?

@zucchini-nlp , test_utils CI pass rate is the same before and after this PR, as below. So no regressions are introduced.
before:
=========================== short test summary info ============================
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_encoder_decoder_shared_encoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_num_assistant_tokens_heuristic_schedule
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_generation_early_exit
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_custom_logits_processor
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_default_max_length_warning
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_beam_search
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_top_k_top_sampling
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generate_compile_fullgraph_tiny
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generated_length_assisted_generation
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_max_new_tokens_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_min_length_if_input_embeds
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_decoder_only
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_encoder_signature_filtering
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_prepare_inputs_for_generation_decoder_llm
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_speculative_decoding_equals_regular_decoding
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_stop_sequence_stopping_criteria
====== 17 failed, 51 passed, 19 skipped, 13 warnings in 133.78s (0:02:13) ======

after:
=========================== short test summary info ============================
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_encoder_decoder_shared_encoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_decoding_num_assistant_tokens_heuristic_schedule
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_assisted_generation_early_exit
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_custom_logits_processor
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_default_max_length_warning
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_beam_search
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_eos_token_id_int_and_list_top_k_top_sampling
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generate_compile_fullgraph_tiny
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_generated_length_assisted_generation
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_max_new_tokens_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_min_length_if_input_embeds
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_decoder_only
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_assisted_decoding_encoder_decoder
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_model_kwarg_encoder_signature_filtering
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_prepare_inputs_for_generation_decoder_llm
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_speculative_decoding_equals_regular_decoding
FAILED tests/generation/test_utils.py::GenerationIntegrationTests::test_stop_sequence_stopping_criteria
====== 17 failed, 51 passed, 19 skipped, 13 warnings in 133.78s (0:02:13) ======

yao-matrix · 2024-11-22T07:18:10Z

LGTM! We need some tests and then I am requesting review from the core maintainer, after that we can merge

thx for reviewing.

zucchini-nlp · 2024-11-22T07:24:25Z

@yao-matrix no worries is some tests are failing and are not related to PR changes. Might be just flaky or will be fixed on main by us. From what I see the only CI test affected by PR is this one + need to see if new test passes for all models

tests/models/gemma2/test_modeling_gemma2.py::Gemma2ModelTest::test_assisted_decoding_with_num_logits_to_keep

Signed-off-by: N <[email protected]>

yao-matrix · 2024-11-26T23:32:24Z

@zucchini-nlp , any more comments for me to iterate? Thx.

zucchini-nlp · 2024-11-27T09:19:46Z

@yao-matrix no, the only thing is the CI which is failing now. I showed the relevant test in prev comment and if you can add one more test in tests/generation/test_utils.py which would test static cache with assisted generation. That is all actually

At the end you need to run make style to pass CI check on codestyle. Feel free to tag the core maintainer @ ArthurZucker for review when tests are ready and CI is green or tag me if you need help/have questions :)

tests/models/jetmoe/test_modeling_jetmoe.py

zucchini-nlp · 2024-11-29T11:09:01Z

tests/models/gemma2/test_modeling_gemma2.py

+    @parameterized.expand([(None, True), ("static", False)])
+    def test_assisted_decoding_with_num_logits_to_keep(self, cache_implementation, return_legacy_cache):
+        if cache_implementation == "static":
+            self.skipTest("Gemma2 has HybridCache which is not compatible with assisted decoding StaticCache")
+            pass
+


let's not skip entirely, but only the static_cache test, as we still need to check if assisted generation works in Gemma2 :)

Maybe it will be skipped by the model._support_static_cache as I've commented above, but if not we can skip only the test_assisted_decoding_with_num_logits_to_keep_1_static (maybe it's called a bit differently)

i switch to _supports_static_cache to skip the case. For Gemma, it's a bit different, since it's using HybridCache and claims _supports_static_cache = True, I still skip it in model test file. Will remove this skip after enable HybridCache for assisted decoding, I plan to enable it after this PR(pure StaticCache) merged, thx.

Signed-off-by: root <[email protected]>

Signed-off-by: N <[email protected]>

…ctually Signed-off-by: N <[email protected]>

Signed-off-by: N <[email protected]>

ArthurZucker

Looks very nice, but we need to add a compile test to make sure this is compile compatible! The whole point of static cache is -> compile! 🤗

yao-matrix · 2024-12-11T08:14:32Z

Looks very nice, but we need to add a compile test to make sure this is compile compatible! The whole point of static cache is -> compile! 🤗

@ArthurZucker i added a test_assisted_decoding_compile case based on test_generate_compile, forward_only test pass for llama, end_to_end test fail for the same reason as Joao commented in test_generate_compile.

Signed-off-by: N <[email protected]>

yao-matrix · 2024-12-13T01:19:45Z

@ArthurZucker @zucchini-nlp , pls let me know any further comments, thx. BTW, checked the failed ci case, not relevant to my changes.

zucchini-nlp · 2024-12-13T08:56:42Z

Thanks, re-triggered the tests, let's wait for the core maintainer

HuggingFaceDocBuilderDev · 2024-12-13T09:21:10Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yao-matrix · 2024-12-18T05:12:41Z

@ArthurZucker , @zucchini-nlp , I am thinking is it possible we leave this PR in 2024, :).

tests/generation/test_utils.py

Signed-off-by: root <[email protected]>

yao-matrix · 2025-01-02T02:50:09Z

@zucchini-nlp @ArthurZucker , any further comments on this?

yao-matrix marked this pull request as draft November 19, 2024 07:28

yao-matrix added 7 commits November 19, 2024 09:06

enable StaticCache for assisted generation

c205b2e

Signed-off-by: Matrix YAO <[email protected]>

update

30021dd

Signed-off-by: YAO Matrix <[email protected]>

remove warnings import

620c861

Signed-off-by: YAO Matrix <[email protected]>

enable StaticCache for assisted generation

b5283e9

Signed-off-by: Matrix YAO <[email protected]>

update

71b7d22

Signed-off-by: YAO Matrix <[email protected]>

remove warnings import

c967bbe

Signed-off-by: YAO Matrix <[email protected]>

done

980aa08

Signed-off-by: N <[email protected]>

yao-matrix force-pushed the main branch from c79411d to 980aa08 Compare November 20, 2024 08:02

yao-matrix marked this pull request as ready for review November 20, 2024 08:02

done

c79411d

Signed-off-by: N <[email protected]>

zucchini-nlp reviewed Nov 21, 2024

View reviewed changes

zucchini-nlp approved these changes Nov 22, 2024

View reviewed changes

yao-matrix and others added 3 commits November 22, 2024 11:08

fix review comments

c717652

Signed-off-by: N <[email protected]>

Merge branch 'main' of https://github.com/yao-matrix/transformers

67618e5

Signed-off-by: N <[email protected]>

Merge branch 'main' into main

c8e2428

Merge branch 'main' into main

fde7ebd

Merge branch 'main' into main

e1169a3

zucchini-nlp reviewed Nov 29, 2024

View reviewed changes

tests/models/jetmoe/test_modeling_jetmoe.py Outdated Show resolved Hide resolved

zucchini-nlp reviewed Nov 29, 2024

View reviewed changes

root added 4 commits November 29, 2024 13:44

add static cache ci

8a9a753

Signed-off-by: root <[email protected]>

Merge branch 'main' of https://github.com/yao-matrix/transformers

b74a7fe

ship Gemma2 StaticCache CI since it uses HybridCache

e67a3fd

Signed-off-by: root <[email protected]>

ruff format

177634c

Signed-off-by: root <[email protected]>

yao-matrix and others added 8 commits December 2, 2024 09:25

ci

817d303

Signed-off-by: N <[email protected]>

add # Ignore copy

6e2ad2a

Signed-off-by: N <[email protected]>

using a smarter way, ignore in test_utils

99b6bc2

Signed-off-by: N <[email protected]>

ci

af33391

Signed-off-by: N <[email protected]>

skip Gemma2, it declars support static cache, but it's hybrid cache a…

587b55f

…ctually Signed-off-by: N <[email protected]>

refine error message

0a49d6f

Signed-off-by: N <[email protected]>

Merge branch 'main' into main

1deeb55

Merge branch 'main' into main

62b70e4

zucchini-nlp requested a review from ArthurZucker December 5, 2024 09:04

ArthurZucker reviewed Dec 10, 2024

View reviewed changes

Merge branch 'main' into main

210c2e0

yao-matrix and others added 4 commits December 11, 2024 04:36

add test case test_assisted_decoding_compile

7b97aa4

Signed-off-by: N <[email protected]>

Merge branch 'main' into main

9cb45da

fix bug

93cd7bf

Signed-off-by: N <[email protected]>

Merge branch 'main' into main

3cc23d7

zucchini-nlp requested a review from ArthurZucker December 13, 2024 08:56

yao-matrix added 2 commits December 16, 2024 07:55

Merge branch 'main' into main

b45336c

Merge branch 'main' into main

04f2ea1

mklasby reviewed Dec 18, 2024

View reviewed changes

tests/generation/test_utils.py Show resolved Hide resolved

yao-matrix and others added 5 commits December 19, 2024 08:40

Merge branch 'main' into main

b08d1fc

Merge branch 'main' into main

0904268

Merge branch 'main' into main

dd148a8

cohere2 is HybridCache

3171476

Signed-off-by: root <[email protected]>

Merge branch 'main' into main

4e064ec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable StaticCache for assisted generation #34797

enable StaticCache for assisted generation #34797

yao-matrix commented Nov 19, 2024

yao-matrix commented Nov 21, 2024

zucchini-nlp left a comment

zucchini-nlp Nov 19, 2024

yao-matrix Nov 22, 2024

zucchini-nlp Nov 22, 2024

zucchini-nlp left a comment •

edited

Loading

yao-matrix commented Nov 22, 2024

yao-matrix commented Nov 22, 2024

zucchini-nlp commented Nov 22, 2024

yao-matrix commented Nov 26, 2024

zucchini-nlp commented Nov 27, 2024

zucchini-nlp Nov 29, 2024

yao-matrix Dec 2, 2024

ArthurZucker left a comment

yao-matrix commented Dec 11, 2024 •

edited

Loading

yao-matrix commented Dec 13, 2024 •

edited

Loading

zucchini-nlp commented Dec 13, 2024

HuggingFaceDocBuilderDev commented Dec 13, 2024

yao-matrix commented Dec 18, 2024

yao-matrix commented Jan 2, 2025

enable StaticCache for assisted generation #34797

Are you sure you want to change the base?

enable StaticCache for assisted generation #34797

Conversation

yao-matrix commented Nov 19, 2024

yao-matrix commented Nov 21, 2024

zucchini-nlp left a comment

Choose a reason for hiding this comment

zucchini-nlp Nov 19, 2024

Choose a reason for hiding this comment

yao-matrix Nov 22, 2024

Choose a reason for hiding this comment

zucchini-nlp Nov 22, 2024

Choose a reason for hiding this comment

zucchini-nlp left a comment • edited Loading

Choose a reason for hiding this comment

yao-matrix commented Nov 22, 2024

yao-matrix commented Nov 22, 2024

zucchini-nlp commented Nov 22, 2024

yao-matrix commented Nov 26, 2024

zucchini-nlp commented Nov 27, 2024

zucchini-nlp Nov 29, 2024

Choose a reason for hiding this comment

yao-matrix Dec 2, 2024

Choose a reason for hiding this comment

ArthurZucker left a comment

Choose a reason for hiding this comment

yao-matrix commented Dec 11, 2024 • edited Loading

yao-matrix commented Dec 13, 2024 • edited Loading

zucchini-nlp commented Dec 13, 2024

HuggingFaceDocBuilderDev commented Dec 13, 2024

yao-matrix commented Dec 18, 2024

yao-matrix commented Jan 2, 2025

zucchini-nlp left a comment •

edited

Loading

yao-matrix commented Dec 11, 2024 •

edited

Loading

yao-matrix commented Dec 13, 2024 •

edited

Loading